Add ThreadSanitizer CI jobs (Linux + macOS)#148
Conversation
There was a problem hiding this comment.
Pull request overview
Adds ThreadSanitizer (TSan) support to the build and CI to detect data races at runtime, integrating it alongside existing sanitizer infrastructure.
Changes:
- Added
ENABLE_THREAD_SANITIZERCMake option and wiring to apply-fsanitize=threadcompile/link flags (with mutual exclusion vsENABLE_SANITIZERS). - Extended the Linux CI template to accept an
enableThreadSanitizerparameter and pass it through to CMake. - Added a new Azure Pipelines Linux/Clang TSan job.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
CMakeLists.txt |
Adds a dedicated CMake toggle for ThreadSanitizer and applies the required toolchain flags. |
.gitignore |
Ignores the UnitTests dist/ output directory. |
.github/jobs/linux.yml |
Adds a pipeline parameter/variable to enable TSan and passes it into the CMake configure step. |
.github/azure-pipelines.yml |
Introduces a new Ubuntu/Clang ThreadSanitizer CI job using the Linux job template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add ENABLE_THREAD_SANITIZER CMake option for detecting data races at runtime. TSan cannot be combined with ASan, so it's a separate option with a mutual exclusion check. Adds an Ubuntu_ThreadSanitizer_clang CI job that runs the full test suite under TSan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MSVC does not support ThreadSanitizer. Added macOS CI job and enableThreadSanitizer parameter to macos.yml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting arcana.cpp: bghgary/arcana.cpp fix/tsan-callback-race UrlLib: bghgary/UrlLib fix/tsan-websocket-race Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add tsan_suppressions.txt to suppress JavaScriptCore/WTF/bmalloc internal data races on Ubuntu (not fixable by us) - Wire suppressions into linux.yml for TSan jobs via TSAN_OPTIONS - Point arcana.cpp at upstream (microsoft/arcana.cpp#61 merged) - Update UrlLib fork SHA (post-rebase, pending BabylonJS/UrlLib#27) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
afa00dc to
2df3c81
Compare
The JSC races manifest in TSan's allocator interceptors (free/malloc/memcpy) called from JSC's JIT worker threads. Function-name suppressions (race:WTF::) don't match because the top frame is a libc function. Use called_from_lib to suppress any race with libjavascriptcoregtk in the call stack. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TSan instrumentation slows tests significantly. The 15-minute default caused the Ubuntu_TSan job to be canceled mid-test. Tests were passing but ran out of time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Inline if/else/endif directives aren't allowed when embedded in a string value. Use block-level conditional insertion instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- race:free, race:close, race:memcpy for TSan interceptors where the top frame is attributed to UnitTests, not libjavascriptcoregtk - signal:libjavascriptcoregtk for JSC signal handler errno issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switches UrlLib source from bghgary/UrlLib fork back to BabylonJS/UrlLib now that the Apple WebSocket data race fixes are merged upstream (BabylonJS/UrlLib#27). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…SED (#160) [Created by Copilot on behalf of @bghgary] ## Problem The WebSocket polyfill throws JavaScript errors in two places where the [WHATWG WebSocket specification](https://websockets.spec.whatwg.org/) requires different behavior: ### `close()` throws when already closing/closed `WebSocket::Close` throws `Error: Close has already been called.` when invoked on a socket whose `readyState` is already `CLOSING` or `CLOSED`. Per the spec, `close()` in those states must be a silent no-op. ### `send()` throws when closing/closed `WebSocket::Send` throws `Error: Websocket readyState is not open.` on any non-OPEN state. Per the spec, `send()` should only throw `InvalidStateError` when `readyState` is `CONNECTING`. When `CLOSING` or `CLOSED`, the data is silently discarded (the spec also bumps `bufferedAmount`, which this polyfill does not track). Because the throws are synchronous from a JS call, a delayed or racing call from an async callback surfaces as an **uncaught** exception that terminates the JS runtime. ## Repro path Observed on the macOS_Xcode164_Sanitizers leg of #148. Sequence (from the existing multi-WebSocket test in `Tests/UnitTests/Scripts/tests.ts`): 1. Test calls `ws.close()` on an open socket. 2. The native side transitions `readyState` -> `Closing` and issues the close. 3. Before the `onclose` callback marshals to the JS thread, another path (a pending `onmessage` handler, or a `catch` block on a failed `send`) calls `ws.close()` again. 4. The second `close()` finds `readyState == Closing` and **throws**, producing: ``` [Uncaught Error] close@[native code] @app:///Scripts/tests.js:27799:22 ``` The `send()` path is symmetric: a send scheduled between `close()` being called and the `onclose` callback firing would throw and escape uncaught. ## Fix ### `close()` Replace the `throw` with an early `return`. No other state is touched, so the first `close()` still drives the transition to `CLOSED` via the normal callback path. ### `send()` Split the single non-OPEN throw into two cases: throw on `CONNECTING`, return silently on `CLOSING`/`CLOSED`. ## Scope - Pure behavioral fix in the WebSocket polyfill. No changes outside `Polyfills/WebSocket/Source/WebSocket.cpp`. - No test changes — existing WebSocket tests already exercise these paths and will stop intermittently failing once this lands. - `bufferedAmount` tracking during the CLOSING/CLOSED send path is still not implemented; out of scope for this fix. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
[Posted by Copilot on behalf of @bghgary] Converting to draft while we investigate a Linux-specific hang in the SymptomThe
When TSan trips on a race early, the run fails in ~2 min. When it doesn't, the tests hang. As the race fixes land, the hang is no longer masked. Step timing on d28ca80
Build + configure are ~100s. The whole job cost is in the hung test run. Last log linesNo output for ~28 minutes after the last Why this is Linux-specificOn the same commit:
Same tests, same TSan. So it is specific to TSan + Linux, i.e. TSan instrumentation layered on top of libcurl and/or libjavascriptcoregtk-4.1 on glibc. macOS uses NSURLSession + macOS JSC; both pass. Hypotheses
Next steps
Keeping per-PR TSan is still worthwhile — this initiative has already surfaced multiple real races. The bottleneck here is a hang to fix, not TSan overhead: the full instrumented pipeline completes in under 3 min on macOS. |
The Ubuntu ThreadSanitizer job on Linux was hitting the 30-minute timeout ~45-75% of runs in a silent hang (no TSan report, no output progress). Reproduced locally in WSL Ubuntu 24.04 with the exact same packages as CI. Root cause ---------- JSC's concurrent garbage collector on Linux suspends each mutator thread at GC safepoints using a SIGUSR1-based protocol: Collector Thread Mutator Thread N ---------------- ---------------- pthread_kill(N, SIGUSR1) -> (signal handler runs, sem_post) sem_wait(sem) <- (handler returns) Under ThreadSanitizer, signal delivery is intercepted and serialized. When the mutator is inside an instrumented section, TSan defers the SIGUSR1 handler indefinitely. The Collector Thread's sem_wait then blocks forever, hanging the whole process. Confirmed with a gdb capture of a hung inferior: Thread 33 "ollector Thread": BabylonJS#5 __interceptor_sem_wait BabylonJS#6 WTF::Thread::suspend(WTF::ThreadSuspendLocker const&) BabylonJS#7-BabylonJS#21 [JSC GC stop-the-world path] Thread 3 "UnitTests": <pending SIGUSR1> (never delivered) macOS JSC uses Mach thread_suspend() rather than Unix signals, which is why the macOS TSan job has been passing in ~2.7 min the whole time. Fix --- Set JSC_useConcurrentGC=0 for the Ubuntu_ThreadSanitizer job only. This removes the dedicated Collector Thread; GC runs on the mutator without any cross-thread signaling. Also revert the 30-min timeout bump — with the hang fixed the Linux TSan job should finish in roughly the same time as macOS TSan (~3 min). Verification ------------ - Default (concurrent GC on) : 9-11 hangs per 20 runs - JSC_useConcurrentGC=0 : 0 hangs per 30 runs [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause identified — and fixed in 5c8a4a6Short version: JSC's concurrent GC on Linux uses SIGUSR1 + EvidenceReproduced locally in WSL Ubuntu 24.04 (same clang-18, cmake 3.28.3, libjavascriptcoregtk-4.1-dev 2.50.4 as CI). Default config: 9/20 hangs. Captured a hung inferior under gdb: The Collector was waiting on the post that Thread 3's SIGUSR1 handler was supposed to do; TSan never let the handler run. Fix verification
Changes in 5c8a4a6
Marking ready for review. Auto-merge is still disabled — leaving that decision to you after checking CI. [Created by Copilot on behalf of @bghgary] |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR BabylonJS#159 migrated the CI pipeline from Azure Pipelines to GitHub Actions and removed .github/azure-pipelines.yml and .github/jobs/*.yml. This merge resolves the modify/delete conflicts by accepting the deletions and re-applies the TSan work on the new workflow structure: - build-linux.yml: adds enable-thread-sanitizer input, wires ENABLE_THREAD_SANITIZER CMake flag, and sets TSAN_OPTIONS plus JSC_useConcurrentGC=0 env for the test run (the concurrent GC's SIGUSR1/sem_wait suspension deadlocks under TSan). - build-macos.yml: adds enable-thread-sanitizer input, wires ENABLE_THREAD_SANITIZER CMake flag, and sets TSAN_OPTIONS. - ci.yml: adds Ubuntu_ThreadSanitizer_clang and macOS_Xcode164_ThreadSanitizer jobs. CMakeLists.txt and .github/tsan_suppressions.txt changes from this branch are preserved unchanged. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ThreadSanitizer (TSan) support for detecting data races at runtime.
[Created by Copilot on behalf of @bghgary]
Changes
ENABLE_THREAD_SANITIZERoption with-fsanitize=thread. Includes a mutual exclusion check withENABLE_SANITIZERSsince TSan and ASan cannot be combined.enable-thread-sanitizerinput, passed asENABLE_THREAD_SANITIZERto CMake. WiresTSAN_OPTIONSwith suppression file and setsJSC_useConcurrentGC=0for the Run Tests step (see below).enable-thread-sanitizerinput; wiresTSAN_OPTIONS.Ubuntu_ThreadSanitizer_clangandmacOS_Xcode164_ThreadSanitizerjobs.called_from_lib:libjavascriptcoregtk). These are allocator-level races in JSC's JIT worker threads that we cannot fix.Linux TSan: Concurrent GC Workaround
JSC on Linux uses
pthread_kill(tid, SIGUSR1)+sem_waitinWTF::Thread::suspend()to stop mutator threads at GC safepoints. TSan's signal interception defers SIGUSR1 delivery indefinitely when the target is inside instrumented code; the handler never runs and the Collector Thread'ssem_waitdeadlocks. macOS JSC uses Machthread_suspend()(no Unix signals) and is unaffected.Setting
JSC_useConcurrentGC=0on the Linux TSan job removes the dedicated Collector Thread; GC runs on the mutator without cross-thread signals. Reproduced locally in WSL Ubuntu: default configuration hung 9/20 runs; withJSC_useConcurrentGC=0, 0/30 runs hung.Dependency Updates
microsoft/arcana.cpp(includes When using UWP and V8, V8 debugging doesn't work #61 — TSan-safe test hook callback mutex).BabylonJS/UrlLibatd251ad44(includes Fix data races in Apple WebSocket for TSan safety UrlLib#27 — Apple WebSocket@synchronizedfixes).Platform Support
TSan is supported on Linux and macOS with Clang/GCC. MSVC does not support TSan and Clang targeting Windows does not have a TSan runtime library.
CI Impact
The TSan jobs run in parallel with other jobs and do not increase overall pipeline time.
Status
called_from_lib:libjavascriptcoregtk; concurrent GC disabled to avoid TSan/signal deadlock